library(kableExtra)
library(tidyverse)
library(dplyr)
[LRJ: Once we finalize the case study, we should come back and revisit the intro/motivation section]
There are over 1400 public schools in Maryland and there are also outstanding students in each school. The federal Every Student Succeeds Act (ESSA) prompted states to develop long term plans to improve schools through accountability and innovation, which sets Maryland’s schools on the path to continuous improvement. Maryland Report Card website aims to share the most current information available to help stakeholders understand and measure student achievement in all 24 local school systems. Here is a message from the State Superintendent of Schools that enables you to know more about how it functions.
Our analysis is inspired by this plan - figure out how well schools were performing and how different factors influenced their performance. Once we indentify schools that need improvements and influential factors, both Maryland’s government and local organizations can prompt actions and provide necessary support, in a way that is understanable and reliable.
The libraries used in this study are listed in the following table, along with their purpose in this particular case study:
| Library | Purpose |
|---|---|
kableExtra |
Helps with building common complex tables and manipulating table styles; creates awesome HTML tables |
tidyverse |
A coherent system of packages for data manipulation, exploration and visualization |
dplyr |
Helps you solve the most common data manipulation challenges |
In order to run this code please ensure you have these packages installed.
The learning objectives for this case study include:
The data files were downloaded from the Maryland Report Card website. Data files can be found by clicking Data Download at the bottom of the web page.
At this link, you can find data for all 1400+ schools in Maryland from 2003 to 2018. For this case study, we will focus on data from 2017. In particular, we will extract data from 5 of the files for that year:
After downloading the 5 data files, we can write a script to import all the files in the same folder into R relatively easily. In this case, we’ve saved all files in a folder named data2017:
Knowing which environment you are in is necessary since it enables R to find these files quickly and accurately. Since our data files are in the folder data2017, we will need to specify this in the path to each file.
Also, instead of importing each file with an individual line of code, we can use the function lapply() to import them all simultaneously. To do this, first we make a list of all of the files names for importing. The list.files() function will make a list of all files in a folder that match a given pattern. Here we want all .csv files, so we can use pattern="*.csv" to specify all files that end with a .csv. The * character means any set of characters can be in that location in the file name.
files = list.files(path='./data2017',pattern="*.csv")
files
## [1] "Cohort_Grad_Rate_2017.csv" "PARCC_2017.csv"
## [3] "Special_Services_2017.csv" "Student_Mobility_2017.csv"
## [5] "Wealth_Expenditures_Data_2017.csv"
Next we can add the folder name, data2017, to the path for each file:
filePaths = paste0("./data2017/", files)
filePaths
## [1] "./data2017/Cohort_Grad_Rate_2017.csv"
## [2] "./data2017/PARCC_2017.csv"
## [3] "./data2017/Special_Services_2017.csv"
## [4] "./data2017/Student_Mobility_2017.csv"
## [5] "./data2017/Wealth_Expenditures_Data_2017.csv"
Finally, we can read in each data file using lapply():
data <- lapply(filePaths, function(x){
read.csv(x, header=T)})
list.files(pattern="*.csv") tries to find all the files whose names ending with ‘csv’, and stores the file location information into files. Function lapply() reads the files one by one and saves in list data.
Tidy datasets are all alike, but every messy dataset is messy in its own way. - Hadley Wickham
This section includes large amount of data wrangling such as gather, merge, mutate etc. Each step requires a specific tool to complete, and we would like to show the dataset again and again everytime we apply a specific tool. RStudio prodives a data viewer that enables you to look inside data frames and other rectangular data structures. You can invoke the viewer in a console by calling the View() function on the data frame you want to look at.
The first file (Cohort_Grad_Rate_2017.csv) records the percentage of students who received a Maryland high school diploma during 2017. Before we extract this variable, let’s have a qucik look of graduation rate file:
View(data[[1]])
There are totally 10 variables and 579 observations in it. From above, in the original dataset, there are several problems need to be fixed, such as missing value. We will deal with them one by one in later steps.
The first step is to get school level data. We notice that if School.Number equals to ‘A’, then this row corresponds to county level data. Also, there are two kinds of graduation rate:
Function filter() and select() in package dplyr will help us to finish the first wrangling step.
df_grad <- data[[1]] %>%
filter( School.Number != 'A',
Cohort == '4-year adjusted cohort') %>%
select(LEA.Name, School.Number,
School.Name, Grad.Rate)
colnames(df_grad) <- c('county', 'sch.num',
'sch', 'grad.rate')
See what we get! Only school level information and useful variables are kept, leaving us a neater dataset. Next, missing values should be fixed.
There are a lot of missing values in the dataset, which would influence further analysis if we leave them there. From the first plot, missing values exist in different forms:
'>= 95.00','<= 5.00' : The graduation rate is higher than 95% or lower than 5%.'*' : This schools’ graduation rate is simply not present in the data.Use function summary() to produce the descriptive statistics of these missing values.
summary(df_grad[2:4])
## sch.num sch
## 301 : 4 Chesapeake High : 2
## 102 : 3 Frederick Douglass High : 2
## 207 : 3 Northwestern High : 2
## 405 : 3 Aberdeen High : 1
## 705 : 3 Academy for College and Career Exploration: 1
## 108 : 2 Academy of Health Sciences at PGCC : 1
## (Other):246 (Other) :255
## grad.rate
## >= 95.00: 57
## * : 27
## <= 5.00 : 9
## 84.02 : 2
## 84.4 : 2
## 86.21 : 2
## (Other) :165
Pay attention to the first kind of missing value. It is impossible to calculate the exact graduate rate because of the lack of information, so we decide to use gsub() function to replace incomplete values. gsub() function replaces all matches of a string. Elements of string vectors which are not substituted will be returned unchanged.
2.5 and 97.5 are used to replace ‘<= 5.00’ and ‘>= 95.00’ respectively.
df_grad$grad.rate<- gsub('<= 5.00','2.5', df_grad$grad.rate)
df_grad$grad.rate <- gsub('>= 95.00','97.5', df_grad$grad.rate)
The second type of missing value (’*’) is confusing, let’s extract them out and have a look:
head(df_grad[df_grad$grad.rate == '*', ])
## county sch.num sch grad.rate
## 6 Anne Arundel 1274 Marley Glen School *
## 16 Anne Arundel 3414 Ruth Parker Eason School *
## 21 Anne Arundel 4304 Central Special School *
## 27 Baltimore County 75 Crossroads Center *
## 29 Baltimore County 111 Maiden Choice School *
## 41 Baltimore County 922 Ridge Ruxton *
If we search Marley Glen School, we may found it mainly provides program for students with disabilities. Also, Ruth Parker Eason School provides a special education program for students with moderate to severe disabilities. So we guess schools with ’*‘graduation rate mainly provide education for students with moderate to severe disabilities. We can either delete it or replace them as ’NA’. Considering we need to merge multiple files in later steps, such kind of information may be dropped out automatically. Ths most appropriate method for us is replacing them as ‘NA’ and deal with them all together.
df_grad$grad.rate <- na_if(df_grad$grad.rate, '*')
Our last step is to extract information from several files and merge them together, requiring a key that specifies each observation uniquely. Possible choices in present dataset can be sch.num or sch.name but we need to check whether each of them appear once in the dataset. If not, they are not suitable choice because of the lack of uniqueness.
summary(df_grad[,2:3])
## sch.num sch
## 301 : 4 Chesapeake High : 2
## 102 : 3 Frederick Douglass High : 2
## 207 : 3 Northwestern High : 2
## 405 : 3 Aberdeen High : 1
## 705 : 3 Academy for College and Career Exploration: 1
## 108 : 2 Academy of Health Sciences at PGCC : 1
## (Other):246 (Other) :255
From the summary, whether sch.num or sch.name, the frequency of some values is larger than 1. Therefore it is definitely important to define a unique element by ourselves.
df_grad <- df_grad %>%
within( id <- paste(county, sch, sep = '-'))
The combination of county and sch generates a new variable - id. To check its uniqueness, we use function table(), which builds a contingency table of the counts at each combination of factor levels. The as.data.frame() function converts the array-based representation of a contingency table to a data frame containing two variable: each factor Var1and its frequency Freq.
tab <- as.data.frame(table(df_grad$id))
tab[tab$Freq > 1,]
## [1] Var1 Freq
## <0 rows> (or 0-length row.names)
Up to now, a unique key element id is generated and we finally create the ideal dataset - a tidy dataset.
Partnership for Assessment of Readiness for College and Careers (PARCC) reflects schools’ academic achievements through assessing the performance of students on state standardized tests, such as English and Math. It not only provides information about students mastery of state standards, but also offers teachers and parents with timely information to inform instruction and how to provide support. From this file, we want to extract the percentage of students scoring ‘high performance’, which works as the target variable is data analysis part.
colnames(data[[2]])
## [1] "Academic.Year"
## [2] "LEA.Number"
## [3] "LEA.Name"
## [4] "School.Number"
## [5] "School.Name"
## [6] "Assessment"
## [7] "Tested.Count"
## [8] "Level.1..Did.not.yet.meet.expectations...Count"
## [9] "Level.1..Did.not.yet.meet.expectations...Percent"
## [10] "Level.2..Partially.met.expectations...Count"
## [11] "Level.2..Partially.met.expectations...Percent"
## [12] "Level.3..Approached.expectations...Count"
## [13] "Level.3..Approached.expectations...Percent"
## [14] "Level.4..Met.expectations...Count"
## [15] "Level.4..Met.expectations...Percent"
## [16] "Level.5..Exceeded.expectations...Count"
## [17] "Level.5..Exceeded.expectations...Percent"
## [18] "Create.Date"
PARCC is a large file with 8989 observations and 18 variables. We notice there are five levels of performance indicators, ranging from Level 1 to Level 5. First, we use percentage information rather than count information. Second, we will create a new variable, pro, that indicates the percentage of students performing at the ‘met expectations’ and ‘exceeded expectations’ levels.
Similiar to graduation rate file, only school level information and necessary variables will be kept and renamed.
df_parcc <- data[[2]] %>%
filter( School.Number != 'A' ) %>%
select(3,4,5,6,9,11,13,15,17)
colnames(df_parcc) <- c('county','sch.num',
'sch', 'subject','L1',
'L2','L3','L4','L5')
Different with graduation rate file that just contains high school information, PARCC file also includes information of middle schools and elementary school. We would leave them there ?????
Variable subject reflects the kind of subject test. English/Language Arts Grade 10, Algebra 1 and 2 are filtered out by function filter() in package dplyr. You are encouraged to keep other assessments that you are interested in.
df_parcc <- df_parcc %>%
filter(subject %in% c('English/Language Arts Grade 10','Algebra 1','Algebra 2'))
Missing value is the toughest problem in this wrangling part. Let’s see its summary first.
summary(df_parcc[, 5:9])
## L1 L2 L3 L4 L5
## <= 5.0 :273 <= 5.0 :176 <= 5.0 :127 <= 5.0 :151 <= 5.0 :597
## 17.6 : 7 28.6 : 9 33.3 : 10 33.3 : 7 6.8 : 7
## 5.5 : 7 15.2 : 8 22.7 : 8 75 : 7 19.8 : 5
## 15.8 : 6 16.7 : 7 25 : 8 11.4 : 5 5.6 : 5
## 5.6 : 6 17.9 : 7 14.3 : 7 17.6 : 5 5.1 : 4
## 6.7 : 6 18.2 : 7 19.7 : 7 34.1 : 5 8.6 : 4
## (Other):569 (Other):660 (Other):707 (Other):694 (Other):252
There is only one kind of missing value: <=5.0 in these 5 levels indicators. We will replace it as NA, then transform them to numerical data.
df_parcc[,5:9] <-
lapply(df_parcc[,5:9], function(x) as.numeric(as.character(x)))
We only care about the proportion of L4 and L5 but NAs are distributed differently. To simply wrangling steps, we reduce these five levels into two new levels : ‘Level weak’ (L1, L2 and L3) and ‘Level excellent’ (L4 and L5). Then deal with NAs in three ways. If values of ‘Level excellent’ are complete, then pro equals to sum of L4 and L5. Such as Algebra 1 assessment of ‘Washington Middle’, its pro should be \(85.5 + 9.7 = 95.2\); If values of ‘Level weak’ all exist, pro equals to 100 percent minus the sum of L1, L2 and L3. For example, the pro of Algebra 1 for ‘Fort Hill High’ equals to \(100-23.9-44.6-21.7 = 9.8\);
df_parcc$pro <- NA
indx1 <- !is.na(df_parcc$L4) & !is.na(df_parcc$L5)
df_parcc[indx1, 'pro'] <- rowSums(df_parcc[indx1,8:9])
sum(is.na(df_parcc$pro))
## [1] 597
indx2 <- !is.na(df_parcc$L1) & !is.na(df_parcc$L2) & !is.na(df_parcc$L3)
df_parcc[indx2, 'pro'] <- 100-rowSums(df_parcc[indx2,5:7])
sum(is.na(df_parcc$pro))
## [1] 203
Function is.na() indicates which elements are missing and returns a boolean index of the same shape as the original data frame. df_parcc[indx1,] and df_parcc[indx2,] specify observations have complete information of ‘Level excellent’ and ‘Level weak’ respectively. sum(is.na(df_parcc$pro)) shows the counts of NA in variable pro. Undoubtedly, our methods reduce NAs greatly after we apply corresponding methods mentioned above.
If missing values exist both in these two levels, we will first use 100 percent minus existing values. Then divide the difference to the number of missing value and replace NA with this quotient. For instance - ‘Brooklyn Park Middle’, there are three NAs, so \(NA = (100-20.8-72.7)/3=2.2\), and pro equals to \(72.7+2.2=74.9\). The reason we don’t replace these NAs as 2.5 directly is that there is a restriction we don’t want to obey - the sum of 5 levels equals to 100 percent. It is more reasonable to assume missing values are uniformly distributed than replace them with a fixed value directly.
indx3 <- is.na(df_parcc$pro)
na_sum <- 100-rowSums(df_parcc[indx3, 5:9], na.rm = TRUE)
n <- rowSums(is.na(df_parcc[indx3, 5:9]))
na <- round(na_sum/n,1)
for (i in 1:length(indx3)){
df_parcc[indx3, 5:9][i,which(is.na(df_parcc[indx3, 5:9][i,]))] <- na[i]
}
df_parcc[indx3, 'pro'] <- rowSums(df_parcc[indx3,8:9])
sum(is.na(df_parcc$pro))
## [1] 0
What the above script does? First, rows containing NA in pro are filtered out and recorded as indx3, totally 874 observations. Then, for each row, we calculate the difference of 100 percent and sum of existing values - na_sum, and the number of missing value - n. Dividing na_sum by n gives the quotients na. All of these three variables are vectors which length are the same as indx3 - 203. The last step is replacement and we write a for loop here. For each row (df_parcc[indx3, 5:9][i,]), columns containing NA are selected out with the help of function is.na(). Applying function which() enables us to get the corresponding columns indexs. Once we determine the positions of NA, we can assign what we calculated before - na to these NAs.
Now, we have 0 NAs, which proves the validity and feasibility of our methods. The following step is to create a unique id - still use the combination of county name and school name. And we only keep new variable pro in the final dataset
pac <- df_parcc %>%
within( id <- paste(county, sch, sep = '-')) %>%
select(1,2,3,4,10,11)
This is how the final parcc dataset looks like:
The proportion of high performance in these assessment (\(p\)) plays an role of response variable in our later analysis. Thus, splicting them out first makes further analysis convenient.
\[ p_{ela} = \beta_0+\beta_{1} x_1+ \beta_{2} x_2 \]
df_ela <- pac[grep("English/Language Arts Grade 10", pac$subject), ]
df_alg1 <- pac[grep("Algebra 1", pac$subject), ]
df_alg2 <- pac[grep("Algebra 2", pac$subject), ]
Special_Services_2017.csv records the number and percentage of students who applied different special service and students approved through direct certification. Such variables make us know more about the school quality and students support. Here we choose service FARMS: receive free or reduced price meals.
What the following script does is similiar to previous wrangling steps except one step. Since in special service file there is a vairable called School.Type, we can filter high school information easily if we add a new condition School.Type == 'High'.
df_spc <- data[[3]] %>%
filter( School.Number != 'A', School.Type == 'High' ) %>%
within( id <- paste(LEA.Name, School.Name, sep = '-')) %>%
select(3,4,5,9,21)
colnames(df_spc) <- c('county', 'sch.num', 'sch', 'farms', 'id')
summary(df_spc)
## county sch.num
## Baltimore City :41 0301 : 4
## Prince George's :37 0102 : 3
## Baltimore County:33 0207 : 3
## Montgomery :31 0405 : 3
## Anne Arundel :18 0705 : 3
## Howard :14 0108 : 2
## (Other) :94 (Other):250
## sch farms
## Chesapeake High : 2 * : 12
## Frederick Douglass High : 2 <= 5.0 : 7
## Northwestern High : 2 35.1 : 3
## Aberdeen High : 1 55.4 : 3
## Academy for College and Career Exploration: 1 12.5 : 2
## Academy of Health Sciences at PGCC : 1 13.8 : 2
## (Other) :259 (Other):239
## id
## Length:268
## Class :character
## Mode :character
##
##
##
##
There is only one problem in this dataset, also it is an old problem - missing value.
Similiarly, let’s have a look on what these schools with ’*’ in farms look like:
df_spc[df_spc$farms == '*',]
## county sch.num sch farms
## 6 Anne Arundel 1274 Marley Glen School *
## 10 Anne Arundel 2233 Anne Arundel Evening High *
## 28 Baltimore County 0077 BCDC Educational Center *
## 57 Calvert 0206 Calvert Country School *
## 70 Carroll 0712 Carroll Springs School *
## 72 Carroll 0717 Post Secondary Program *
## 123 Howard 0522 Cedar Lane Special Center *
## 156 Montgomery 0799 Stephen Knolls School *
## 159 Montgomery 0951 Longview School *
## 196 Prince George's 2217 Incarcerated Youth Center (JACS) *
## 220 Wicomico 0520 Wicomico County Evening High *
## 267 Baltimore City 0884 Eager Street Academy *
## id
## 6 Anne Arundel-Marley Glen School
## 10 Anne Arundel-Anne Arundel Evening High
## 28 Baltimore County-BCDC Educational Center
## 57 Calvert-Calvert Country School
## 70 Carroll-Carroll Springs School
## 72 Carroll-Post Secondary Program
## 123 Howard-Cedar Lane Special Center
## 156 Montgomery-Stephen Knolls School
## 159 Montgomery-Longview School
## 196 Prince George's-Incarcerated Youth Center (JACS)
## 220 Wicomico-Wicomico County Evening High
## 267 Baltimore City-Eager Street Academy
It seems that they are still not ‘normal’ public schools so we change ’*’ to NA.
After replacing ‘<= 5.0’ to ‘2.5’ with function gsub(), if you apply as.numeric() function, ’*’ will be transformed to NA automatically.
df_spc$farms <- gsub('<= 5.0','2.5', df_spc$farms)
df_spc$farms <- as.numeric(df_spc$farms)
This is how the final special serivce dataset looks like:
Student_Mobility_2017.csv includes information of the movement of students from one school to another during the school year. There are 3 types of mobility provided: total student mobility, entry mobility, and exit mobility. To some extent, exit mobility reflects students’ satisfaction with their schools so we would like to use it as variable withdraw in our analysis.
As usual, wrangling steps are high school observation selection, unique id creation, feature selection and missing value transformation.
df_mob <- data[[4]] %>%
filter( School.Number != 'A', School.Type == 'High' )%>%
within( id <- paste(LEA.Name, School.Name, sep = '-')) %>%
select(3,4,5,11,15)
colnames(df_mob) <- c('county', 'sch.num', 'sch', 'withdraw', 'id')
summary(df_mob)
## county sch.num
## Baltimore City :41 301 : 4
## Prince George's :37 102 : 3
## Baltimore County:33 207 : 3
## Montgomery :31 405 : 3
## Anne Arundel :18 705 : 3
## Howard :14 108 : 2
## (Other) :94 (Other):250
## sch withdraw
## Chesapeake High : 2 <= 5.0 : 77
## Frederick Douglass High : 2 >= 95.0: 9
## Northwestern High : 2 5.5 : 8
## Aberdeen High : 1 8.6 : 5
## Academy for College and Career Exploration: 1 10.3 : 3
## Academy of Health Sciences at PGCC : 1 10.9 : 3
## (Other) :259 (Other):163
## id
## Length:268
## Class :character
## Mode :character
##
##
##
##
df_mob$withdraw <- gsub('<= 5.00','2.5', df_mob$withdraw)
df_mob$withdraw <- gsub('>= 95.00','97.5', df_mob$withdraw)
df_mob$withdraw <- as.numeric(df_mob$withdraw)
This is how the final mobility dataset looks like:
Wealth_Expenditures_Data_2017.csv contains information of different kinds of investments as well as wealth per pupil. What makes it special is that its size (11 variables and 25 observations) is obsivously smaller than previous files. Let’s have a look on it:
It turns out that this file only contains county level data! So we would like to keep variable county which will work as the key in later merge steps. Besides, a new variable exp will be created by the formula:
\[exp = \frac{Wealth.Per.Pupil}{Expenditures.Per.Pupil}\]
The quotient of Wealth.Per.Pupil and Expenditures.Per.Pupil captures the image of each county’s financial condition. Function mutate() in package dplyr is a nice tool to adds new variables. Thus, wrangling steps are new variable creation, feature selection and rename.
df_exp <- data[[5]] %>%
mutate(exp = round(Wealth.Per.Pupil/Expenditures.Per.Pupil, 1)) %>%
select(3,12)
colnames(df_exp) <- c('county','exp')
We notice the last row is information about ‘All Public School’, but it will be dropped automatically when we merge it.
Now we have totally 8 datasets: | Dataset | Size |
|———-|———–| | df_grad |2645| | df_mob |2685| | df_spc |2685| | df_exp |252| | df_ela |2346| | df_alg1 |4736| | df_alg2 |167*6|
We have many datasets, requiring combination to answer the questions that we are interested in. They are called relational data because it is their relations, not just the individual datasets, that are important. Relations are defined between a pair in these datasets. In out case, relations presented in the form that the same variable exists in multiple datasets. The following plot describes how our datasets connect:
We can see that variable county exists in all datasets, variable sch, sch.name and id exists in all datasets except df_exp. In addition, each dataset contain its unique variables.
To work with relational data, mutating joins is pretty powerful which combines variables from two datasets. It first matches observations by their keys, then copies across variables from one dataset to the other. keys is a variable that uniquely indentifies an observation so it can connect each pair of datasets. This is the meaning of the existence of variable id - sufficiently indentifies each observation in our datasets.
mutating joins includes inner joins and outer joins. An inner join keeps observations that appear in both datasets while outer join keeps observations that appear in at least one of the datasets. There are three types of outer joins: left join, right join and full join. The graphical explanation is given below:
Inner join:
Outer join:
To be more specific, a table is provided to show detailed descriptions.
| Types | How it works | Implementation in dplyr |
|---|---|---|
| inner join | keeps observations that appear in both datasets | inner_join() |
| left join | keeps all observations in x but drops bbservations in y but not in x |
left_join() |
| right join | keeps all observations in y but drops bbservations in x but not in y |
right_join() |
| full join | keeps all observations in x and y |
full_join() |
Left join is the most common and popular method, especially when we only need part of variables from another dataset. It preserves the original observations even when there isn’t a match - just make it as NA. This is also our best choice and we will show how it works in our case.
The final dataset should include predict variables grad.rate, withdraw, farms, exp and target variable pro. Three datasets with this structure will be created for df_ela, df_alg1 and df_alg2 respectively. As mentioned, we would focus on left join with the help of function left_join(). Let’s see an example for df_ela first:
temp <- df_ela %>%
select(6,1,5) %>%
left_join(df_grad[, c('grad.rate', 'id')], by = 'id')
head(temp)
## id county pro grad.rate
## 1 Allegany-Fort Hill High Allegany 40.9 82.86
## 2 Allegany-Allegany High Allegany 52.9 90.78
## 3 Allegany-Mountain Ridge High School Allegany 54.2 87.82
## 4 Anne Arundel-Glen Burnie High Anne Arundel 37.9 90.3
## 5 Anne Arundel-North County High Anne Arundel 49.5 86.18
## 6 Anne Arundel-Severna Park High Anne Arundel 83.1 97.5
Look! grad.rate in df_grad was added to df_ela successfully. Option by= enables you to specify the keys you would like to use. Then, we repeat this step for df_mob, df_spc and df_exp.
| Dataset | Variable Kept | Keys |
|---|---|---|
| df_grad | grad.rate: 4-year graduation rate |
id |
| df_mob | withdraw: students’ exit mobility |
id |
| df_spc | farms: percentage of students reciving free/reduced price meals |
id |
| df_exp | exp: quotient of wealth and expenditures |
county |
temp1 <- df_ela %>%
select(6,1,5) %>%
left_join(df_grad[, c('grad.rate', 'id')], by = 'id') %>%
left_join( df_mob[, c('withdraw', 'id')], by = 'id') %>%
left_join( df_spc[, c('farms', 'id')], by = 'id') %>%
left_join(df_exp, by = 'county')
Instead of repeating by ourselves, the R base package provides a function Reduce(), which can come in handy. Reduce() takes a function \(f\) of two arguments and a list or vector \(x\) which is to be reduced using \(f\). The function is first called on the first two components of \(x\), then with the result of that as the first argument and the third component of \(x\) as the second argument, then again with the result of the second step as first argument and the fourth component of \(x\) as the second argument etc. The process is continued until all elements of \(x\) have been processed.
L_ela <- list(df_ela[, c('id', 'county', 'pro')],
df_grad[, c('id','grad.rate')],
df_mob[, c('id','withdraw')],
df_spc[, c('id','farms')], df_exp)
temp2 <- Reduce(function(x,y) left_join(x, y,
by = colnames(y)[1]), L_ela)
# `by = colnames(y)[1])` tells how to find the *keys* in each left_join.
The above script shows how Reduce() function works in out case. First, these five files are save in list L_ela. Then function left_join() was applied one by one in list L_ela with function Reduce(). temp2 is the same dataset as temp1. So, let’s write it as a function and apply it to df_alg1 and df_alg2.
func_merge <- function(data){
L <- list(data[, c('id', 'county', 'pro')],
df_grad[, c('id','grad.rate')],
df_mob[, c('id','withdraw')],
df_spc[, c('id','farms')], df_exp)
final_data<- Reduce(function(x,y) left_join(x, y,
by = colnames(y)[1]), L)
return(final_data)
}
df_ELA <- func_merge(df_ela)
df_ALG1 <- func_merge(df_alg1)
df_ALG2 <- func_merge(df_alg2)
Wonderful! Finally we get the ideal datasets: df_ELA, df_ALG1 and df_ALG2. However, the size of df_ALG1 is greatly larger than the other two. Let’s have a look on it:
Do you remember we didn’t drop the middle school information before? Here comes the result - since left_join() preserves the original observations even when there isn’t a match. All observations in df_alg1 are kept, but we don’t have data of grad.rate, withdraw and farms since we have selected high school information in their wrangling section. Therefore NAs are generated by function left_join() for these middle schools. Besides, pro of middle schools are much higher than that of high schools. It is because students who take Algebra1 test in middle schoos are more likely to be good at this subject. This fact makes them incompariable with students who take Algebra 1 test in high schools. If we delete middle school information, variable pro is pretty lower for the left high schools, so it fails to reflect this schools’ academic achievements in Algebra 1. For df_ELA and df_ALG2, such problem doesn’t exist so we would like to work on these two datasets.